Skip to content

Fix stateless HTTP task accumulation causing memory leak#2145

Open
wiggzz wants to merge 15 commits intomodelcontextprotocol:mainfrom
wiggzz:fix/stateless-task-group-leak
Open

Fix stateless HTTP task accumulation causing memory leak#2145
wiggzz wants to merge 15 commits intomodelcontextprotocol:mainfrom
wiggzz:fix/stateless-task-group-leak

Conversation

@wiggzz
Copy link

@wiggzz wiggzz commented Feb 25, 2026

Summary

  • Fix memory leak where run_stateless_server tasks accumulate indefinitely in the manager's global task group after client disconnects
  • Use request-scoped task groups instead of the global _task_group for stateless requests, ensuring tasks are cancelled when requests complete
  • Add graceful SSE drain on shutdown: terminate all active transports before cancelling the task group, so SSE streams close cleanly instead of being abruptly reset
  • Add regression tests for both the leak and the graceful drain

Motivation and Context

Related to #1764 (the underlying task accumulation issue in stateless mode).
Fixes #1272 (graceful shutdown / SSE drain).

Root cause — memory leak

When a client disconnects while a tool call is in progress, EventSourceResponse detects the disconnect and tears down the SSE stream pipeline. However, Server.run() — which is processing the tool call in a separate task spawned into the manager's global _task_group — continues running. After the tool handler completes, the response has nowhere to go (the request streams were cleaned up), and Server.run() blocks indefinitely on async for message in session.incoming_messages. These zombie tasks accumulate one per disconnected request.

In production, we observed 449 leaked ServerSession objects, 66,086 orphaned coroutines, 4,463 blocked stream receive operations, and ~150 MB of leaked heap — all from this pattern.

Root cause — broken SSE close on shutdown

The memory leak fix moves stateless server tasks from the manager's global task group to request-scoped task groups. This fixes the leak, but it also means the manager's tg.cancel_scope.cancel() during shutdown no longer reaches stateless transports — nothing ever closes their streams, so SSE responses hang indefinitely. A reverse proxy like nginx would log "upstream prematurely closed connection while reading upstream".

The graceful drain fix adds explicit terminate() calls in run()'s finally block (for both stateful and stateless transports) before cancelling the task group. This closes the in-memory streams, letting EventSourceResponse send a final more_body=False chunk — a clean HTTP close.

Why the existing test didn't catch the leak

The existing test (test_stateless_requests_memory_cleanup, added in PR #1140) mocks app.run to return immediately, so the spawned tasks complete naturally. It never exercises the real-world scenario where client disconnects leave app.run() blocked.

The fixes

Memory leak: Replace the global task group spawn with a request-scoped anyio.create_task_group(). Both the server task and the request handler run inside this scoped group. When the request handler finishes (or is cancelled due to client disconnect), the scope is cancelled, which automatically cancels the server task. No tasks leak into the manager's long-lived task group.

Graceful drain: Track in-flight stateless transports in _stateless_transports. In run()'s finally block, call terminate() on all active transports (stateful via _server_instances, stateless via _stateless_transports) before tg.cancel_scope.cancel().

Testing

  • test_stateless_requests_task_leak_on_client_disconnect: Real Server.run() with a real tool handler, real SSE streaming, simulated client disconnect — verifies zero leaked tasks after 3 disconnected requests
  • test_graceful_shutdown_closes_sse_streams_cleanly: Uses an ASGI send() interceptor to verify that SSE responses receive more_body=False (the ASGI equivalent of a proper HTTP chunked transfer close) during manager shutdown. Without the fix, the test times out because nothing ever closes the stateless transport's streams.

All 13 tests pass.

Breaking changes

None. This is an internal implementation change. The public API and behavior are unchanged.

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

@wiggzz wiggzz force-pushed the fix/stateless-task-group-leak branch 2 times, most recently from 67d02d3 to a7a340d Compare February 25, 2026 17:00
@maxisbey maxisbey added bug Something isn't working P2 Moderate issues affecting some users, edge cases, potentially valuable feature labels Mar 5, 2026
wiggzz added a commit to dbt-labs/mcp-python-sdk that referenced this pull request Mar 6, 2026
Use request-scoped task groups for stateless requests instead of the
manager's global task group.  When a client disconnects while a tool
call is in progress, the request-scoped task group is cancelled,
preventing zombie tasks from accumulating.

In production, this caused 449 leaked ServerSession objects, 66,086
orphaned coroutines, and ~150 MB of leaked heap.

Upstream PR: modelcontextprotocol#2145
Github-Issue: modelcontextprotocol#1764
wiggzz added a commit to dbt-labs/mcp-python-sdk that referenced this pull request Mar 9, 2026
Use request-scoped task groups for stateless requests instead of
the manager's global task group, preventing zombie task accumulation
when clients disconnect mid-request.

Upstream PR: modelcontextprotocol#2145
wiggzz added a commit to dbt-labs/mcp-python-sdk that referenced this pull request Mar 9, 2026
wiggzz and others added 14 commits March 9, 2026 13:21
…tprotocol#756)

In stateless mode, each request spawned a `run_stateless_server` task
into the manager's global `_task_group`. After `handle_request()`
completed and `terminate()` was called, the task continued running
inside `app.run()`, blocked on `async for message in
session.incoming_messages`. These zombie tasks accumulated indefinitely,
leaking memory.

Replace the global task group spawn with a request-scoped task group
so that server tasks are automatically cancelled when their request
completes.

Add a regression test that simulates the real blocking behavior of
`app.run()` using `anyio.sleep_forever()` and verifies no tasks
linger in the global task group after requests finish.

Closes modelcontextprotocol#756

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntextprotocol#1764 (open)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add anyio.sleep(0) between terminate() and cancel_scope.cancel() so
  the message router can observe the ClosedResourceError from the closed
  streams before the task group is cancelled (restores line 1022 coverage)
- Remove handle_list_tools from test — it was never called (only
  call_tool is exercised), so its return was an uncovered line
- Mark the CallToolResult return as pragma: no cover — intentionally
  unreachable in this test because the task is always cancelled at
  await tool_gate.wait() before the return executes
- Collapse multi-line function signatures to satisfy ruff-format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sleep(0) yield in the stateless request handler allows the event
loop to exercise exception paths that were previously unreachable with
mocked tests. These paths are legitimately reachable and the more
realistic test now covers them consistently:

- streamable_http_manager.py: except Exception in run_stateless_server
- streamable_http.py: except Exception in SSE pipeline (x2)
- streamable_http.py: if not self.mcp_session_id (always true in stateless mode)
- session.py: except Exception in receive loop cleanup

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sleep(0) between terminate() and cancel_scope.cancel() was
non-deterministic: sometimes the event loop ran exception handlers
(covering the pragmas), sometimes the cancel won the race (leaving
lines uncovered). This caused strict-no-cover to flip between
"wrongly marked" and "missing coverage" across CI runs.

Revert to the simple, deterministic approach:
- Restore all original pragma: no cover annotations
- Add pragma: no cover for the ClosedResourceError/_terminated path
  in the message router, which is unreachable with request-scoped
  task groups (Cancelled always wins over ClosedResourceError)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The if-branch check itself is reachable; it's the body that isn't.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cancel the request-scoped task group first so the Cancelled exception
deterministically reaches the server task before terminate() closes the
streams. This avoids a race between Cancelled and ClosedResourceError
in the message router.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change 5 pragma annotations from strict "no cover" to "lax no cover"
since these exception handlers may or may not fire depending on
cancellation timing. Fix ruff-format in test file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark 4 test assertions as lax no cover — coverage.py on Python 3.11
falsely reports them as uncovered despite the test passing. Also fix
ruff format for assert message and async with blocks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add None check assertion and type: ignore for accessing private
_tasks attribute on TaskGroup (needed to verify no leaked tasks).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous approach ran handle_request inside a child task within the
request-scoped task group. This caused coverage.py on Python 3.11 to
lose track of execution when returning from the task group back to the
caller — the root cause of the false coverage gaps on test assertions.

The fix is simpler: run handle_request directly in the async with body
(like the original code on main), with only run_stateless_server as a
child task. This preserves the request-scoped task group for preventing
task accumulation while keeping the caller's execution context intact.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Terminate all active transports (stateful and stateless) before
cancelling the task group during shutdown.  This closes in-memory
streams so EventSourceResponse can send a final more_body=False
chunk — a clean HTTP close instead of a connection reset.

Without this, the request-scoped task groups introduced for the
memory leak fix isolate stateless transports from the manager's
cancel scope, so nothing ever closes their streams on shutdown.
@wiggzz wiggzz force-pushed the fix/stateless-task-group-leak branch from 92d4f67 to 3179143 Compare March 9, 2026 18:22
wiggzz added a commit to dbt-labs/mcp-python-sdk that referenced this pull request Mar 10, 2026
Request-scoped task groups for stateless requests prevent zombie task
accumulation when clients disconnect mid-request. Graceful drain on
shutdown terminates all active transports before cancelling the task
group, allowing SSE responses to close cleanly.

Backported to v1.26.0 for compatibility with dbt-mcp and ai-codegen-api.

Upstream PR: modelcontextprotocol#2145
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working P2 Moderate issues affecting some users, edge cases, potentially valuable feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Server hangs when shutting down if a connection is still open

2 participants